Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Enable experiment restore from moved cloud uri #31669

Merged
merged 26 commits into from
Jan 24, 2023

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

This PR fixes experiment restoration from a different cloud URI to save future results and checkpoints to new URI instead of continuing to write to the old location. The workflow of starting a local experiment, uploading the experiment dir to cloud, then restoring from the URI from a different cluster is also possible now.

Example:

# initial experiment experiment directory at "s3://original/exp_name"
# experiment directory moved to "s3://new/new_exp_name"
Tuner.restore("s3://new/new_exp_name")
# new results/checkpoints should get synced to the new URI "s3://new/new_exp_name"

This is a follow-up to #29920, which enabled experiment restoration from a moved local experiment directory.

I also took the chance to simplify some of the trial loading/serialization logic.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix constructor args

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix all references to _checkpoint_dir

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…e ckpt dir

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu added tune Tune-related issues air labels Jan 13, 2023
# since the dir might not be creatable locally.
# TODO(ekl) this is kind of a hack.
if not ray.util.client.ray.is_connected():
trial.init_logdir() # Create logdir if it does not exist
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this out of Trial.__setstate__. Are trials serialized/deserialized in any situation that would require this to stay within __setstate__? My understanding is that trial objects don't get shipped around and stay on the Tune driver. Only serialized/deserialized on experiment checkpoint and restore, which is handled here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for moving this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct @justinvyu and I think your solution is much cleaner. For setstate/getstate, it's ok to just restore exactly the state that was saved. If properties are overwritten, that should happen in the function that issues the restore. No need for magic here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krfricke doesn't this make trials effectively unserializable in the general case unless the trainable is registered? I don't think that's an issue, but perhaps something to consider

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…m trial runner tests

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu marked this pull request as ready for review January 18, 2023 17:58
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks! Only a couple of nits

# since the dir might not be creatable locally.
# TODO(ekl) this is kind of a hack.
if not ray.util.client.ray.is_connected():
trial.init_logdir() # Create logdir if it does not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct @justinvyu and I think your solution is much cleaner. For setstate/getstate, it's ok to just restore exactly the state that was saved. If properties are overwritten, that should happen in the function that issues the restore. No need for magic here.

python/ray/tune/tests/test_tuner_restore.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_tuner_restore.py Outdated Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@krfricke krfricke merged commit 47f4da3 into ray-project:master Jan 24, 2023
krfricke added a commit that referenced this pull request Jan 27, 2023
…2010)

#31669 changed the `Trial.__dict__` by moving `local_dir` to `_local_dir`, which resulted in an error in our tune cloud tests. This PR updates the signature of the `TrialStub` class to resolve the issue.

Signed-off-by: Kai Fricke <kai@anyscale.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…y-project#32010)

ray-project#31669 changed the `Trial.__dict__` by moving `local_dir` to `_local_dir`, which resulted in an error in our tune cloud tests. This PR updates the signature of the `TrialStub` class to resolve the issue.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@justinvyu justinvyu deleted the restore_from_moved_cloud_uri branch April 26, 2023 20:13
justinvyu added a commit that referenced this pull request Oct 30, 2023
…th (#40647)

This is a known regression introduced in 2.7: moving the path of the experiment directory and attempting to restore the experiment and/or the experiment results doesn't work due to the absolute paths saved in the trial metadata.

This PR implements a fix similar to #31669 -- replacing the root of the tracked checkpoint paths with the new storage path, and updating on experiment restoration / result loading from a path.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tune Tune-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants